The ongoing battle to improve air quality in the United States highlights how crucial it is for everyone's health now and in the future. As the world becomes more connected, clean air matters a lot. It's not just a problem for places with obvious pollution issues – even countries with good education and healthcare can struggle with bad air. Air quality isn't only about health; it's also about safety, politics, and money. Because information and problems can spread fast, clean air becomes everybody's concern. Making the air better for everyone is tough because there are many factors involved, but like solving any hard problem, breaking it into smaller parts that relate to different groups helps. To do that, we need to figure out how to measure air quality's effects. Among many things to measure, looking at the air quality index is a simple and shared way to understand how clean or polluted the air is and how it affects health.
The Air Quality Index (AQI) is a system for measuring the quality of the air we breathe, providing a standardized and accessible measure of air pollution's potential impact on human health. AQI takes into account multiple air pollutants that can effect the respiratory and cardiovascular system, such as sulfur dioxide, nitrogen dioxide, and carbon monoxide. Each pollutant is assigned a numerical value, which is then converted into an index value on a scale that ranges from 0 to 500 or higher, depending on the specific AQI system being used. The AQI scale is divided into categories, typically ranging from Good to Hazardous.
The calculation of AQI involves several steps. First, measured pollutant concentrations are collected from monitoring stations. These concentrations are then compared to specific health-based standards or guidelines set by environmental agencies. The highest measured concentration among the pollutants is selected as the "dominant pollutant" for calculating the AQI. This dominant pollutant's concentration is then plugged into a formula that translates it into a numerical index value. The resulting AQI value is matched to a corresponding category and color code, indicating the potential health implications associated with the current air quality. AQI values can fluctuate throughout the day based on changes in pollutant concentrations, meteorological conditions, and human activities.
The AQI's significance lies in its ability to provide a clear and easily understandable representation of air quality to the public, empowering individuals to make informed decisions about outdoor activities and potential health risks. Furthermore, it serves as a valuable tool for policymakers and regulatory agencies to identify trends, develop effective pollution control strategies, and communicate the urgency of addressing air quality concerns. In short, the AQI bridges the gap between complex air quality data and the general population.
Lung cancer is a tumor that grows in the lungs. It is one of the most common cancers worldwide and is primarily caused by long-term exposure to harmful substances, with smoking being the leading risk factor. However, air quality could play a significant role in the development of lung cancer. Pollutants can be inhaled into the lungs, causing chronic inflammation and DNA damage, which can eventually lead to the development of cancerous cells. Studies have shown that areas with poor air quality have higher incidences of lung cancer, so understanding and analysing air quality is very important.
Bronchitis and COPD are both chronic respiratory conditions that affect the airways and lung function. Bronchitis is characterized by inflammation of the bronchial tubes. COPD is a broader term that encompasses chronic bronchitis and emphysema. Emphysema involves the destruction of lung tissue and the enlargement of air spaces, leading to reduced lung elasticity and impaired airflow. Both conditions are strongly influenced by air quality.
The primary objective of this tutorial is to delve into the realm of air quality trends across the United States, employing visualization techniques to offer a clearer understanding of how air quality has evolved over time. By harnessing the power of data visualization, we aim to unravel intricate patterns and fluctuations in air quality measurements, thereby providing an insightful portrayal of the changing landscape of pollutants in the atmosphere. Through the meticulous examination of historical air quality data, we seek to identify long-term trends, seasonal variations, and potential hotspots of concern.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress
from statsmodels.formula.api import ols
import statsmodels.formula.api as smf
from urllib.request import urlopen
import json
import plotly.express as px
All of the air quality data comes from the Enviormental Protection Agency. the Environmental Protection Agency (EPA ) is a regulatory agency within the United States government, responsible for safeguarding and promoting environmental health and sustainability. The EPA's mission revolves around formulating and enforcing regulations aimed at curbing pollution, mitigating climate change, and preserving natural resources.
All of the cancer data comes from the Centers for Disease Control and Prevention (CDC). The CDC's mission is to protect the nation's health and safety by preventing and controlling the spread of diseases, injuries, and other health threats. The CDC is actively involved in cancer surveillance, prevention, and control. Through initiatives like the National Program of Cancer Registries (NPCR ) and the Surveillance, Epidemiology, and End Results (SEER ) program, the CDC collects and analyzes data on cancer incidence, prevalence, and outcomes. This information is essential for understanding cancer trends, identifying risk factors, and guiding targeted interventions.
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
Read data from CSV file and store into a dataframe
air_data = pd.DataFrame()
for i in range(20):
temp_db = pd.read_csv('annual_aqi_by_county_'+str(2001+i)+'.csv')
air_data = pd.concat([air_data, temp_db], ignore_index=True)
#remove unwanted charaters
air_data['County'] = air_data['County'].str.replace(r'\(', '', regex=True)
air_data['County'] = air_data['County'].str.replace(r'\)', '', regex=True)
#remove unwanted
air_data = air_data.drop(air_data[air_data['State'] == "Canada"].index)
air_data = air_data.drop(air_data[air_data['State'] == "Virgin Islands"].index)
air_data = air_data.drop(air_data[air_data['State'] == "Puerto Rico"].index)
air_data = air_data.drop(air_data[air_data['State'] == "Country Of Mexico"].index)
air_data.reset_index(drop=True, inplace=True)
# get fips data
# This will be helpful later for mapping purposes
fips_state_df = pd.read_csv('state_fips.csv')
fips_county_df = pd.read_csv('county_fips.csv')
# reanme
new_column_names = {'code': 'state_code','name': 'State'}
fips_state_df.rename(columns=new_column_names, inplace=True)
# merge flips_county_df, flips_state_df to get state county codes
fips = pd.merge(fips_county_df, fips_state_df, on='state_code', how='left')
# clean the fips data
fips = fips.drop(columns=['state_code', 'county_code'])
fips['name'] = fips['name'].str.replace(r'County', '', regex=True)
fips['name'] = fips['name'].str.replace(r'Borough', '', regex=True)
fips['name'] = fips['name'].str.replace(r'Parish', '', regex=True)
fips.rename(columns={'code': 'code','name': 'County', 'State':'State'}, inplace=True)
fips['County'] = fips['County'].str.strip()
fips['code'] = fips['code'].astype(str).str.zfill(5)
# merge fips data with air data to match fips code with county data
air_data_with_fips = pd.merge(air_data, fips, on=['County', 'State'], how='inner')
#need leading 0 for states codes less than 10
Seperate the data for later use
# seperate air data by state
# state_dataframes_yearly
air_data_by_state = dict()
for state_name, group_df in air_data.groupby('State'):
air_data_by_state[state_name] = group_df
# seperate air data by year
air_data_by_year = dict()
for year, group_df in air_data.groupby('Year'):
air_data_by_year[year] = group_df
#seperate air data by county
air_data_by_state_county = dict()
for (s,c), group_df in air_data.groupby(['State', 'County']):
air_data_by_state_county[(s,c)] = group_df
#seperate air data by county
air_data_by_year_state = dict()
for (y,s), group_df in air_data.groupby(['Year', 'State']):
air_data_by_year_state[(y,s)] = group_df
cancer_data = pd.DataFrame()
for i in range(46):
temp_db = pd.read_csv('CountyMap ('+str(1+i)+').csv')
cancer_data = pd.concat([cancer_data, temp_db], ignore_index=True)
# remove all missing data
cancer_data = cancer_data.drop(cancer_data[cancer_data['Case Count'] == "Data not presented"].index)
cancer_data.reset_index(drop=True, inplace=True)
cancer_data = cancer_data.drop('Year', axis=1)
cancer_data = cancer_data.drop('Cancer Type', axis=1)
cancer_data = cancer_data.drop('Sex', axis=1)
#remove word county from county colunm
cancer_data['County'] = cancer_data['County'].str.replace(r'County', '', regex=True)
cancer_data.rename(columns={'Area': 'State', 'Cases': 'Num Cases'}, inplace=True)
cancer_data['County'] = cancer_data['County'].str.rstrip()
cancer_data['Case Count'] = cancer_data['Case Count'].astype(float)
cancer_data['Population'] = cancer_data['Population'].astype(float)
cancer_data['Age-Adjusted Rate'] = cancer_data['Age-Adjusted Rate'].astype(float)
cancer_data
| State | County | Age-Adjusted Rate | Case Count | Population | |
|---|---|---|---|---|---|
| 0 | Alabama | Perry | 31.1 | 21.0 | 45518.0 |
| 1 | Alabama | Washington | 41.5 | 52.0 | 81682.0 |
| 2 | Alabama | Choctaw | 41.9 | 48.0 | 63777.0 |
| 3 | Alabama | Shelby | 45.4 | 592.0 | 1080873.0 |
| 4 | Alabama | Lee | 45.6 | 361.0 | 817306.0 |
| ... | ... | ... | ... | ... | ... |
| 2466 | Wyoming | Johnson | 46.5 | 34.0 | 42590.0 |
| 2467 | Wyoming | Campbell | 49.7 | 106.0 | 234790.0 |
| 2468 | Wyoming | Natrona | 52.2 | 251.0 | 400336.0 |
| 2469 | Wyoming | Crook | 54.6 | 35.0 | 37508.0 |
| 2470 | Wyoming | Weston | 78.0 | 42.0 | 34708.0 |
2471 rows × 5 columns
Seperate the data for later use
# split cancer data into dataframes by state
cancer_data_by_state = dict()
for state_name, group_df in cancer_data.groupby('State'):
cancer_data_by_state[state_name] = group_df
cancer_data_by_state['Alaska']['Case Count'].sum()
1783.0
# split cancer data into dataframes by county
cancer_data_by_state_county = dict()
for (s,c), group_df in cancer_data.groupby(['State', 'County']):
cancer_data_by_state_county[(s,c)] = group_df
county_copd_data = pd.read_csv('County_COPD_prevalence.csv')
Clean data up
county_copd_data.rename(columns={'StateDesc': 'State'}, inplace=True)
county_copd_data['LocationID'] = county_copd_data['LocationID'].astype(str).str.zfill(5)
years = []
upper_values = []
median_values = []
good_days = []
# Loop through each year's DataFrame
for year_data in air_data_by_year:
# Calculate the mean of the '90th Percentile AQI' column
upper_aqi = air_data_by_year[year_data]['90th Percentile AQI'].mean()
median_aqi = air_data_by_year[year_data]['Median AQI'].mean()
percent_good_days = air_data_by_year[year_data]['Good Days'].mean()/air_data_by_year[year_data]['Days with AQI'].mean()
# Append year and mean value to the lists
years.append(year_data)
upper_values.append(upper_aqi)
median_values.append(median_aqi)
good_days.append(percent_good_days)
plt.figure(figsize=(10, 6))
plt.plot(years, upper_values)
plt.xlabel('Year')
plt.ylabel('Mean 90th Percentile AQI')
plt.title('Mean 90th Percentile AQI by Year')
plt.xticks(years)
plt.tight_layout()
# Show the plot
plt.show()
plt.figure(figsize=(10, 6))
plt.plot(years, median_values)
plt.xlabel('Year')
plt.ylabel('Median AQI')
plt.title('Median AQI by Year')
plt.xticks(years)
plt.tight_layout()
plt.show()
plt.figure(figsize=(10, 6))
plt.plot(years, good_days)
plt.xlabel('Year')
plt.ylabel('Percent')
plt.title('Percentage of Good Days by Year')
plt.xticks(years)
plt.tight_layout()
plt.show()
The graphs above show the positive trend of air quality in the United States. Over the last 20 years, there was a consistent decline in the median Air Quality Index (AQI) and 90th percentile AQI. Lastly the last graph shows the number of 'Good Days' as a percent of the days observed.
states = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware',
'District Of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa',
'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota',
'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico',
'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island',
'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',
'West Virginia', 'Wisconsin', 'Wyoming']
years = [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020]
plt.figure(figsize=(12, 8))
plt.xticks(years)
for state in states:
arr = []
curr_years = []
for year in years:
mean = air_data_by_year_state[(year, state)]['Median AQI'].mean()
arr.append(mean)
curr_years.append(year)
plt.plot(years, arr)
plt.show()
The graph above shows median AQI over 20 years across every state in the united states. To get these lines, the air quality data was averaged for each county in each state. This is not nessisrly the best way to jusdge weather a state has a high or low AQI becucuase size/population of a county can skew the data. But in this senario averaging is a good indicator of wheather or not air quality is getting better or worse.
import matplotlib.pyplot as plt
# Sample data
plt.figure(figsize=(14, 10))
plt.xticks(rotation=90)
categories = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Iowa', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']
values = []
for s in categories:
values.append(cancer_data_by_state[s]['Age-Adjusted Rate'].sum()/len(cancer_data_by_state[s]))
# Create a bar graph
sorted_data = sorted(zip(values, categories), reverse=True)
sorted_values, sorted_labels = zip(*sorted_data)
plt.bar(sorted_labels, sorted_values)
plt.bar(categories, values)
# Adding labels and title
plt.bar(sorted_labels, sorted_values, color='tab:blue')
plt.xlabel('States')
plt.ylabel('Cases per 10000 people')
plt.title('Age Ajusted Cancer Rate')
# Display the graph
plt.show()
plt.figure(figsize=(14, 10))
plt.xticks(rotation=90)
categories = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Iowa', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']
values = []
for s in categories:
values.append(cancer_data_by_state[s]['Case Count'].sum()/cancer_data_by_state[s]['Population'].sum())
# Create a bar graph
sorted_data = sorted(zip(values, categories), reverse=True)
sorted_values, sorted_labels = zip(*sorted_data)
plt.bar(sorted_labels, sorted_values)
# Adding labels and title
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Graph Example')
# Display the graph
plt.show()
plt.figure(figsize=(14, 10))
plt.xticks(rotation=90)
county_copd_data
copd_data_by_state = dict()
for state_name, group_df in county_copd_data.groupby('StateDesc'):
copd_data_by_state[state_name] = group_df
categories = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'District of Columbia', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Mexico', 'New York', 'North Carolina', 'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania', 'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington', 'West Virginia', 'Wisconsin', 'Wyoming']
values = []
for s in categories:
values.append(copd_data_by_state[s]['Percent_COPD'].mean())
# Create a bar graph
sorted_data = sorted(zip(values, categories), reverse=True)
sorted_values, sorted_labels = zip(*sorted_data)
plt.bar(sorted_labels, sorted_values)
# Adding labels and title
plt.xlabel('Categories')
plt.ylabel('Values')
plt.title('Bar Graph Example')
# Display the graph
plt.show()
First, merge data with fips codes so each datapoint has a loction
air_data_with_fips = pd.merge(air_data, fips, on=['County', 'State'], how='inner')
cancer_data_with_fips = pd.merge(cancer_data, fips, on=['County', 'State'], how='inner')
cancer_data_with_fips
| State | County | Age-Adjusted Rate | Case Count | Population | code | |
|---|---|---|---|---|---|---|
| 0 | Alabama | Perry | 31.1 | 21.0 | 45518.0 | 01105 |
| 1 | Alabama | Washington | 41.5 | 52.0 | 81682.0 | 01129 |
| 2 | Alabama | Choctaw | 41.9 | 48.0 | 63777.0 | 01023 |
| 3 | Alabama | Shelby | 45.4 | 592.0 | 1080873.0 | 01117 |
| 4 | Alabama | Lee | 45.6 | 361.0 | 817306.0 | 01081 |
| ... | ... | ... | ... | ... | ... | ... |
| 2384 | Wyoming | Johnson | 46.5 | 34.0 | 42590.0 | 56019 |
| 2385 | Wyoming | Campbell | 49.7 | 106.0 | 234790.0 | 56005 |
| 2386 | Wyoming | Natrona | 52.2 | 251.0 | 400336.0 | 56025 |
| 2387 | Wyoming | Crook | 54.6 | 35.0 | 37508.0 | 56011 |
| 2388 | Wyoming | Weston | 78.0 | 42.0 | 34708.0 | 56045 |
2389 rows × 6 columns
Slice data so it can be graphed
air_data_by_year_fips = dict()
for year, group_df in air_data_with_fips.groupby('Year'):
air_data_by_year_fips[year] = group_df
Get map data
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
counties = json.load(response)
fig = px.choropleth(air_data_by_year_fips[2005], geojson=counties, locations='code', color='Median AQI',
color_continuous_scale="Reds",
range_color=(0, 110),
scope="usa",
labels={'unemp':'unemployment rate'}
)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
fig = px.choropleth(cancer_data_with_fips, geojson=counties, locations='code', color='Age-Adjusted Rate',
color_continuous_scale="Reds",
range_color=(0, 200),
scope="usa",
labels={'unemp':'unemployment rate'}
)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
This map shows
fig = px.choropleth(county_copd_data, geojson=counties, locations='LocationID', color='Percent_COPD',
color_continuous_scale="Reds",
range_color=(0, 15),
scope="usa",
labels={'unemp':'unemployment rate'}
)
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
merged_df = pd.merge(air_data_by_year[2005], cancer_data, on=['State', 'County'], how='inner')
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
merged_df['Age-Adjusted Rate'] = merged_df['Age-Adjusted Rate'].astype(float)
merged_df['90th Percentile AQI'] = merged_df['90th Percentile AQI'].astype(float)
merged_df.columns = merged_df.columns.str.replace(' ', '_')
merged_df.columns = merged_df.columns.str.replace('-', '_')
fig, ax = plt.subplots(figsize=(6, 4))
ax.scatter(merged_df['90th_Percentile_AQI'], merged_df['Age_Adjusted_Rate'])
slope, intercept, r_value, p_value, std_err = linregress(merged_df['90th_Percentile_AQI'], merged_df['Age_Adjusted_Rate'])
x_vals = merged_df['90th_Percentile_AQI']
y_vals_regression = slope * x_vals + intercept
ax.plot(x_vals, y_vals_regression, color='red', label='Regression Line')
ax.text(0.7, 0.1, slope, transform=ax.transAxes, fontsize=12, bbox=dict(facecolor='white', edgecolor='gray', boxstyle='round,pad=0.5'))
ax.set_xlabel('90th Percentile AQI')
ax.set_ylabel('Age-Adjusted Rate')
fig, ax = plt.subplots(figsize=(6, 4))
ax.scatter(merged_df['Median_AQI'], merged_df['Age_Adjusted_Rate'])
slope, intercept, r_value, p_value, std_err = linregress(merged_df['Median_AQI'], merged_df['Age_Adjusted_Rate'])
x_vals = merged_df['Median_AQI']
y_vals_regression = slope * x_vals + intercept
ax.plot(x_vals, y_vals_regression, color='red', label='Regression Line')
ax.text(0.7, 0.1, slope, transform=ax.transAxes, fontsize=12, bbox=dict(facecolor='white', edgecolor='gray', boxstyle='round,pad=0.5'))
ax.set_xlabel('Median')
ax.set_ylabel('Age-Adjusted_Rate')
Text(0, 0.5, 'Age-Adjusted_Rate')
import matplotlib.pyplot as plt
import pandas as pd
from scipy.stats import linregress
# Sample data (replace with your actual data)
years = [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010]
# Create a grid of subplots (6 rows, 2 columns)
num_rows = len(years)
num_cols = 2
fig, axs = plt.subplots(num_rows, num_cols, figsize=(12, 32))
for row, year in enumerate(years):
# Assuming you have air_data_by_year and cancer_data
merged_df = pd.merge(air_data_by_year[year], cancer_data, on=['State', 'County'], how='inner')
merged_df['Age-Adjusted Rate'] = merged_df['Age-Adjusted Rate'].astype(float)
merged_df['90th Percentile AQI'] = merged_df['90th Percentile AQI'].astype(float)
merged_df.columns = merged_df.columns.str.replace(' ', '_')
merged_df.columns = merged_df.columns.str.replace('-', '_')
ax = axs[row, 0]
ax.scatter(merged_df['Median_AQI'], merged_df['Age_Adjusted_Rate'])
slope, intercept, r_value, p_value, std_err = linregress(merged_df['Median_AQI'], merged_df['Age_Adjusted_Rate'])
x_vals = merged_df['Median_AQI']
y_vals_regression = slope * x_vals + intercept
ax.plot(x_vals, y_vals_regression, color='red', label='Regression Line')
ax.text(0.7, 0.1, f'Slope: {slope:.2f}', transform=ax.transAxes, fontsize=12, bbox=dict(facecolor='white', edgecolor='gray', boxstyle='round,pad=0.5'))
ax.set_xlabel('Median AQI')
ax.set_ylabel('Age-Adjusted Rate')
ax.set_title(f'{year} - Scatter Plot with Regression Line')
# First subplot
ax = axs[row, 1]
ax.scatter(merged_df['90th_Percentile_AQI'], merged_df['Age_Adjusted_Rate'])
slope, intercept, r_value, p_value, std_err = linregress(merged_df['90th_Percentile_AQI'], merged_df['Age_Adjusted_Rate'])
x_vals = merged_df['90th_Percentile_AQI']
y_vals_regression = slope * x_vals + intercept
ax.plot(x_vals, y_vals_regression, color='red', label='Regression Line')
ax.text(0.7, 0.1, f'Slope: {slope:.2f}', transform=ax.transAxes, fontsize=12, bbox=dict(facecolor='white', edgecolor='gray', boxstyle='round,pad=0.5'))
ax.set_xlabel('90th Percentile AQI')
ax.set_ylabel('Age-Adjusted Rate')
ax.set_title(f'{year} - Scatter Plot with Regression Line')
# Second subplot
plt.tight_layout() # Adjust spacing between subplots
plt.show()
years = [2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010]
for row, year in enumerate(years):
merged_df_copd = pd.merge(air_data_by_year[year], county_copd_data, on=['State', 'County'], how='inner')
# Data cleaning: Remove rows with missing values
merged_df_copd = merged_df_copd.dropna(subset=['Median AQI', 'Percent_COPD'])
plt.figure() # Create a new figure
ax = plt.gca() # Get the current axes
ax.scatter(merged_df_copd['Median AQI'], merged_df_copd['Percent_COPD'])
slope, intercept, r_value, p_value, std_err = linregress(merged_df_copd['Median AQI'], merged_df_copd['Percent_COPD'])
# Check if the slope is NaN after cleaning data
if not pd.isnull(slope):
x_vals = merged_df_copd['Median AQI']
y_vals_regression = slope * x_vals + intercept
ax.plot(x_vals, y_vals_regression, color='red', label='Regression Line')
ax.text(0.7, 0.1, f'Slope: {slope:.2f}', transform=ax.transAxes, fontsize=12, bbox=dict(facecolor='white', edgecolor='gray', boxstyle='round,pad=0.5'))
ax.set_xlabel('Median AQI')
ax.set_ylabel('Percent COPD')
ax.set_title('Scatter Plot with Regression Line')
plt.show()